Skip to content

feat(cloudtrail-alerts): per-detector exclusions + 8 new detectors + SSO MFA fix#277

Open
Cre-eD wants to merge 4 commits into
mainfrom
feat/cloudtrail-alerts-exclusions-and-new-detectors
Open

feat(cloudtrail-alerts): per-detector exclusions + 8 new detectors + SSO MFA fix#277
Cre-eD wants to merge 4 commits into
mainfrom
feat/cloudtrail-alerts-exclusions-and-new-detectors

Conversation

@Cre-eD
Copy link
Copy Markdown
Contributor

@Cre-eD Cre-eD commented May 19, 2026

Why

Production deployment in everworker exposed the alert-fatigue profile the plugin produces today: ~16 fires/day, ~89% false positives. Over 14 days of CloudWatch alarm history:

Detector Fires Events Source
unauthorizedApiCalls 87 1,829 1,205 AWS service self-probes (PutObject from s3.amazonaws.com, StartQuery from cloudtrail.amazonaws.com, BatchGetImage from lambda.amazonaws.com), 439 prowler-readonly, 126 AWSServiceRoleFor* (ResourceExplorer, Config), 14 Drata. 399 anonymous Azure-IP probes for GetFunctionUrlConfig (only real signal in the bucket).
iamPolicyChanges 54 999 100% integrail-deployer-bot (Pulumi CI from Hetzner runner IPs)
securityGroupChanges 51 291 100% integrail-deployer-bot (Pulumi CI deploy churn)
s3BucketPolicyChanges 30 151 94% integrail-deployer-bot + 6% github-actions-pulumi-github
consoleLoginWithoutMfa 8 8 AWS Identity Center sessions by 2 admins — well-documented CIS CloudWatch.3 false positive (MFAUsed=No because MFA is enforced upstream at the IdP)
9 others (root, tampering, kms, config, vpc, nacl, gateway, route, failed-login) 0 0 silent — built-in detectors offered no knobs to tune the noisy ones

The built-in detectors matched the CIS CloudWatch.1-14 set verbatim and exposed only bool toggles, so the only tuning options were "leave the noise on" or "disable a CIS control entirely." This PR adds knobs.

What

1. Per-detector overrides

New CloudTrailAlertOverride struct, keyed by detector name in selectors.overrides:

alerts:
  iamPolicyChanges: true
  overrides:
    iamPolicyChanges:
      excludeUserNames: ["integrail-deployer-bot", "github-actions-pulumi-github"]
    unauthorizedApiCalls:
      threshold: 10
      excludeUserTypes: ["AWSService"]
      excludeUserNames: ["prowler-readonly"]
      excludeInvokedBy: ["s3.amazonaws.com", "lambda.amazonaws.com"]
      excludeUserArnGlobs: ["arn:aws:sts::*:assumed-role/AWSServiceRoleFor*/*"]
    consoleLoginWithoutMfa:
      excludeUserArnGlobs: ["arn:aws:sts::*:assumed-role/AWSReservedSSO_*"]  # belt-and-braces

Exclusion clauses are baked into the CloudWatch metric filter pattern at provision time (suppression at source — excluded events never increment the metric, the alarm never trips, the Lambda is never invoked, the CW alarm dashboard reflects real signal). Threshold / period / evaluationPeriods overrides are exposed too.

Architecture choice rationale (cross-checked with codex + gemini): source-side suppression over runtime suppression because:

  • CW alarm dashboards stay clean under source-side — can't distinguish suppressed-FP from real-breach at a glance under runtime suppression.
  • Auditors (SOC2 CC6.1) want documented, peer-reviewed exception lists in YAML, not opaque env vars on a Lambda.
  • Runtime suppression still costs CW metric storage + Lambda invocations.

2. SSO MFA false positive fix

consoleLoginWithoutMfa is now scoped to userIdentity.type = "IAMUser". AWS Identity Center / federated console sessions always emit ConsoleLogin with additionalEventData.MFAUsed = "No" because MFA is enforced at the IdP, not at the AWS console step. Without this scope every SSO admin login triggered a CloudWatch.3 alert. Identity Center coverage belongs in a separate detector against signin.amazonaws.com / UserAuthentication events (not added here — different signal, different threshold model).

References:

3. Eight beyond-CIS detectors

Default off (opt-in). Cover attacker-blinding moves and exposure paths CIS CloudWatch.1-14 does not address:

Selector Catches Why CIS misses it
guardDutyDisabled DeleteDetector, UpdateDetector, Disassociate/Delete/StopMonitoringMembers CIS.9 only covers AWS Config recorder
securityHubDisabled DisableSecurityHub, BatchDisableStandards, DisableImportFindingsForProduct, DeleteActionTarget, DeleteInsight Same blinding-move category
accessKeyCreation CreateAccessKey Long-lived creds are higher-risk than STS; rotation tracking
s3PublicAccessChanges Put/DeleteAccountPublicAccessBlock, Put/DeleteBucketPublicAccessBlock CIS.8 only covers bucket policy edits; BPA is the higher-leverage gate
lambdaUrlPublic CreateFunctionUrlConfig / UpdateFunctionUrlConfig with authType = "NONE" Single click of misconfiguration exposes a function's IAM role as a public HTTPS endpoint
kmsKeyPolicyChanges PutKeyPolicy, PutResourcePolicy, CreateGrant, Retire/RevokeGrant CIS.7 only covers key deletion; grants quietly hand decrypt rights
organizationsChanges SCP CRUD + Enable/DisablePolicyType + LeaveOrganization + RemoveAccountFromOrganization SCPs are the strongest preventative control in a multi-account org
anonymousProbes userIdentity.type=AWSAccount AccessDenied probes, threshold 10/5min The 399 GetFunctionUrlConfig probes in the production sample were buried in the noise of generic CIS.2

Backwards compatibility

  • Plain-bool selector form is unchanged. Existing consumers see identical filter patterns + alarm parameters.
  • New overrides map is optional; zero-value CloudTrailAlertOverride is a no-op.
  • New 8 detectors are default off — no consumer gains alerts on plugin upgrade.

Tests

Existing tests updated for totalDetectors = 22. 9 new test cases:

  • TestApplyOverride_Exclusions — single exclusion field
  • TestApplyOverride_MultipleExclusionFields — all 6 exclusion fields together
  • TestApplyOverride_Deterministic — YAML re-ordering produces identical filter pattern (no Pulumi churn)
  • TestApplyOverride_ThresholdAndPeriod — non-zero overrides applied
  • TestApplyOverride_EmptyOverrideIsNoop — zero value is no-op
  • TestApplyOverride_EmptyStringsSkipped["", "real-bot", ""] produces one clause, not three
  • TestEnabledAlerts_OverrideApplied — overrides reach enabledAlerts
  • TestConsoleLoginWithoutMfa_RestrictedToIAMUser — locks down the SSO FP fix
  • TestAnonymousProbes_DefaultThreshold — locks down the threshold=10 default
ok  	github.com/simple-container-com/api/pkg/clouds/aws	0.098s
ok  	github.com/simple-container-com/api/pkg/clouds/aws/helpers	0.103s
ok  	github.com/simple-container-com/api/pkg/clouds/pulumi/aws	0.222s

Test plan

  • go test ./pkg/clouds/aws/... ./pkg/clouds/pulumi/aws/... passes
  • go vet ./pkg/clouds/aws/... ./pkg/clouds/pulumi/aws/... clean
  • Downstream integrail/devops PR uses the new fields against a live preview environment; verify alarms tripping → metric filter no longer matches Pulumi CI events → alarm history shows the gap
  • After SC release, devops PR applies exclusions in everworker prod; re-measure 14d fire rate, target ≤2/week

…SSO MFA fix

Why: production deployment exposed alert-fatigue patterns — ~16 fires/day with
~89% false positives from governed automation (Pulumi CI bot, Prowler scanner,
AWS Identity Center sessions, AWS service-linked roles). Built-in detectors
matched the CIS CloudWatch.1-14 set verbatim, leaving no knob to tune.

Plugin extensions:

- `CloudTrailAlertOverride` schema with optional per-detector exclusions
  (userName / principalId / arn / arn-glob / userType / invokedBy) baked into
  the CloudWatch metric filter pattern at provision time. Suppression happens
  at the metric layer rather than in the Lambda enrichment step so excluded
  events never increment the metric, the alarm never trips, and the CW alarm
  dashboard reflects real signal (preserves SOC2 CC6.1 / ISO 27001 audit
  clarity). Threshold / period / evaluationPeriods overrides too.
- `consoleLoginWithoutMfa` now scoped to `userIdentity.type = "IAMUser"`.
  AWS Identity Center / federated console sessions always emit
  `MFAUsed = "No"` because MFA happens upstream at the IdP — without the
  scope, every SSO console session triggered a CloudWatch.3 false positive.
  Identity Center coverage belongs in a separate detector against
  `signin.amazonaws.com / UserAuthentication` (not added here).
- 8 beyond-CIS detectors covering attacker-blinding moves and exposure paths
  that CIS does not address: GuardDuty disable, Security Hub disable, IAM
  access key creation, S3 Block Public Access toggle, Lambda Function URL
  with AuthType=NONE, KMS key policy / grant changes, AWS Organizations SCP
  churn, anonymous external probes (`userIdentity.type=AWSAccount` AccessDenied
  with default threshold=10 to require sustained reconnaissance, not page on
  one-off probes).

Backwards compat: plain-`bool` selector form is unchanged. Existing consumers
who declare `iamPolicyChanges: true` see identical filter patterns and alarm
parameters. New struct fields and `overrides` map are opt-in.

Tests: 14 new test cases covering exclusion clause generation, deterministic
ordering across YAML re-orderings, threshold / period override application,
empty-string skipping, and SSO MFA scope. Existing tests updated for
totalDetectors = 22 (14 CIS + 8 additions).

Signed-off-by: Dmitrii Creed <creeed22@gmail.com>
@Cre-eD Cre-eD requested a review from smecsia as a code owner May 19, 2026 17:07
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 19, 2026

Semgrep Scan Results

Repository: api | Commit: 47dc5d6

Check Status Details
⚠️ Semgrep Warning 10 warning(s), 10 total

Scanned at 2026-05-20 11:43 UTC

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 19, 2026

Security Scan Results

Repository: api | Commit: 47dc5d6

Check Status Details
✅ Secret Scan Pass No secrets detected
✅ Dependencies (Trivy) Pass 2 total (no critical/high)
✅ Dependencies (Grype) Pass 2 total (no critical/high)
📦 SBOM Generated 509 components (CycloneDX)

Scanned at 2026-05-20 11:44 UTC

Cross-model review of PR #277 (codex + gemini, both fact-checked against
AWS primary docs) surfaced 6 correctness bugs that would have shipped to
prod. Each is fixed here with a test that asserts the desired behavior.

1. NOT EXISTS guard on every exclusion clause.

   CloudWatch metric-filter `$.field != "x"` returns FALSE when the field
   is absent, not TRUE. Without a NOT EXISTS guard, an unguarded exclusion
   like `$.userIdentity.userName != "bot"` would silently drop every event
   whose userIdentity lacks the field at the top level — AssumedRole events
   in particular carry userName at $.userIdentity.sessionContext.sessionIssuer
   .userName, NOT at $.userIdentity.userName. The unguarded form would have
   dropped the ENTIRE detector for every assumed-role principal, not just
   the bot we meant to exclude.

   buildExclusionClauses now generates:
     (($.field NOT EXISTS) || ($.field != "v1"))                 # single value
     (($.field NOT EXISTS) || (($.field != "v1") && ($.field != "v2")))  # multi-value

   Multi-value uses inner-AND (De Morgan'd from "NOT (v1 OR v2)"). New tests:
   TestApplyOverride_Exclusions, TestApplyOverride_NotExistsGuard_MultipleValues,
   TestApplyOverride_MultipleExclusionFields, TestApplyOverride_DeDupesValues.

   Ref: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/FilterAndPatternSyntax.html
   Ref: https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-event-reference-user-identity.html

2. Remove PutResourcePolicy from kmsKeyPolicyChanges.

   PutResourcePolicy is not a KMS API. (It exists on CloudTrail Lake and
   a few other services, but never under eventSource=kms.amazonaws.com,
   so the clause was permanent dead code.) KMS uses PutKeyPolicy for
   resource policy edits. Comment updated to record why this is excluded
   to prevent re-introduction.

   Ref: https://docs.aws.amazon.com/kms/latest/APIReference/API_Operations.html

3. Expand organizationsChanges to cover OU + delegated admin moves.

   Added MoveAccount, RemoveAccountFromOrganization, RegisterDelegatedAdministrator,
   DeregisterDelegatedAdministrator, EnableAWSServiceAccess, DisableAWSServiceAccess.
   RegisterDelegatedAdministrator is the canonical "blind the management account"
   move: delegate GuardDuty/Security Hub admin to a compromised member, then
   suppress findings there.

   Ref: https://docs.aws.amazon.com/organizations/latest/APIReference/API_Operations.html

4. Add Root to consoleLoginWithoutMfa scope.

   Earlier fix scoped to userIdentity.type=IAMUser to silence the AWS Identity
   Center false positive. That silently dropped Root console logins (Root type
   is "Root", not "IAMUser"). rootAccountUsage detector catches ALL Root activity
   including MFA-protected sessions; this one specifically surfaces Root-without-MFA
   for triage. Scope expanded to ($.userIdentity.type = "IAMUser" || = "Root").

5. Remove DeleteInsight from securityHubDisabled.

   Insights are saved dashboard searches, not detectors. Deleting one is a UI/
   housekeeping action, not a "visibility lost" event — pure noise.

6. Fail-fast validation of overrides keys + reflection test for wireup.

   - New `validateOverrides()` checked at provision time: rejects map keys
     that don't correspond to any detector. Catches YAML typos like
     `overrides: { unauthorizedApiCall: ... }` (missing trailing s) at deploy
     time with a list of valid keys, instead of silently dropping the override
     and leaving the operator wondering why their exclusion didn't take.
   - selectorChecks extracted to a single source of truth so reflection test
     can assert bidirectional consistency: every selector bool maps to a
     securityAlerts entry and vice versa, count == totalDetectors. Catches
     the regression where a contributor adds a detector but forgets the wireup.

   New tests: TestValidateOverrides_Empty, TestValidateOverrides_KnownKey,
   TestValidateOverrides_UnknownKeyIsLoudError, TestSelectorChecksWireUpAllDetectors,
   TestApplyOverride_WorstCaseBasePattern (covers leading whitespace + internal
   OR-groups in base filter pattern under the trim-and-rewrap path).

Test summary: 17 tests now pass (was 14 in the first push of PR #277).

Signed-off-by: Dmitrii Creed <creeed22@gmail.com>
@Cre-eD
Copy link
Copy Markdown
Contributor Author

Cre-eD commented May 19, 2026

Round 2: review-driven correctness fixes

Followup commit 4cc1a03 addresses 6 findings surfaced by a parallel codex + gemini review (both fact-checked against AWS primary docs). Triage table below — wontfix items are deferred follow-ups, not unaddressed concerns.

Fixed in this PR

# Finding Source Fix
1 != "value" returns FALSE on missing field → unguarded exclusions silently dropped every AssumedRole event from the detector (userName lives at $.userIdentity.sessionContext.sessionIssuer.userName, not at top level). codex (fact-checked) buildExclusionClauses now wraps each clause as (($.field NOT EXISTS) || inner). Multi-value uses De Morgan'd inner-AND. 4 new tests including AssumedRole semantics.
2 PutResourcePolicy is not a KMS API — dead clause. codex + gemini Removed. Comment records why to prevent re-introduction.
3 organizationsChanges missing MoveAccount, RegisterDelegatedAdministrator (canonical "blind the management account" move), DeregisterDelegatedAdministrator, Enable/DisableAWSServiceAccess. codex + gemini Added.
4 consoleLoginWithoutMfa IAMUser-only scope silently dropped Root coverage (Root type = "Root", not "IAMUser"). gemini Scope expanded to IAMUser || Root.
5 securityHubDisabled included DeleteInsight (saved dashboard search, not detector blinding) — pure noise. gemini Removed.
6 YAML typos in overrides: map were silently ignored. Reflection gap: new detector could be added without wiring through enabledAlerts. codex + gemini validateOverrides() rejects unknown keys at provision time with (known: [...]) hint. selectorChecks extracted as single source of truth; reflection test asserts bidirectional consistency with securityAlerts map.

Wontfix (filed in head — recorded as follow-ups, not blockers)

Finding Source Why deferred
"CI bot exclusion is a security bypass" codex/gemini Real concern but the alert layer is the wrong defense. Canonical fix: SCP / permission boundary on the bot principal. No usable "expected event names" subset exists because the bot legitimately uses the full IAM/SG/S3 event set during deploys. Recorded as a KNOWN TRADE-OFF block in the devops PR.
"lambdaUrlPublic overstates exposure — AuthType=NONE not public without resource policy" codex Technically true but >95% of AuthType=NONE Function URLs in practice are public (AWS console auto-adds the policy). Detector still useful as a strong signal. lambda:AddPermission with Principal: * is a separate detector to add later.
"anonymousProbes will misfire on legitimate cross-account access from partners" codex No partner integrations documented. If one appears, override schema already supports excludePrincipalIds.
"GuardDuty UpdateDetector is broad" codex Constraining to requestParameters.enable=false adds filter complexity for marginal benefit. Follow-up if it proves noisy in prod.
"Period validation missing" codex Pulumi apply errors loudly on invalid period. Nice-to-have.
"Unauthenticated userIdentity.type" gemini Not in AWS documented type values (Root, IAMUser, AssumedRole, FederatedUser, AWSAccount, AWSService, IdentityCenterUser, Unknown). Skipped as unverified.

Tests

17 tests pass (was 14). go vet clean. Build green.

ok      github.com/simple-container-com/api/pkg/clouds/aws         0.133s
ok      github.com/simple-container-com/api/pkg/clouds/aws/helpers 0.139s
ok      github.com/simple-container-com/api/pkg/clouds/pulumi/aws  0.150s

Cre-eD added a commit that referenced this pull request May 20, 2026
…IEW_BRANCH (+SHA pin) (#278)

## Why

Production users testing a feature-branch SC build via `sc.sh` are
blocked by the Phase 2c failgate: the cert-identity regex passed to
`cosign verify-blob` is hard-pinned to `push.yaml@refs/heads/main`.
Preview tarballs published by `branch-preview.yaml` are legitimately
Sigstore-signed but cannot install. The only workarounds today are `curl
+ tar -xz` (bypasses signature verification entirely) or fork sc.sh
locally. Both are worse than a documented, narrowly-scoped opt-in.

Discovered while validating the CloudTrail security alerts plugin (PR
#277) against a preview build of that branch.

## What

### `sc.sh` — branch-pinned (and optionally SHA-pinned) opt-in

```bash
# Minimum (branch-pinned):
SIMPLE_CONTAINER_TRUST_PREVIEW_BRANCH=feat/your-branch \
SIMPLE_CONTAINER_VERSION=YYYY.M.D-pre.<sha>-preview.<sha> \
  bash <(curl -Ls https://dist.simple-container.com/sc.sh)

# Recommended for CI (branch + commit SHA pin):
SIMPLE_CONTAINER_TRUST_PREVIEW_BRANCH=feat/your-branch \
SIMPLE_CONTAINER_TRUST_PREVIEW_SHA=4cc1a03ca5c259a428e07d4f0bb8eb9120a6e2b7 \
SIMPLE_CONTAINER_VERSION=YYYY.M.D-pre.<sha>-preview.<sha> \
  bash <(curl -Ls https://dist.simple-container.com/sc.sh)
```

`verify_sc_tarball` widens the accepted identity regex to ALSO accept
`branch-preview.yaml@refs/heads/<that exact branch>`. The branch name is
allowlist-validated (`^[A-Za-z0-9._/-]+$`, no `..` /
leading-trailing-slash / `.lock`-suffix) and the `.` regex metachar is
escaped before interpolation, so a value like `evil.*` or `feat.foo`
cannot re-open the permissive trust set.

When `_SHA` is set, cosign is given `--certificate-github-workflow-sha`
which verifies against the Sigstore certificate's GitHub OIDC
`workflow_sha` claim (OID `1.3.6.1.4.1.57264.1.3`) — pinning
verification to a specific commit, not the mutable branch head.

### `sc.sh` — rejects the over-permissive `=1` form outright

The prior commit had `SIMPLE_CONTAINER_ALLOW_PREVIEW=1` (any branch).
Codex + Gemini both flagged that as a meaningful trust expansion —
anyone with push access to any branch in the repo could ship a tarball
that the regex would accept once a user opted in for "their" branch. Now
`=1` fails closed with a pointer to `_BRANCH`.

### `sc.sh` — loud stderr warning + smarter error messages

- Every preview install emits a multi-line stderr warning listing the
trusted branch + whether SHA pin is in effect. Mitigates the "persistent
export in shell rc silently relaxes trust" footgun.
- On verification failure in strict mode, the error message detects the
preview-signed case, parses the signer branch out of cosign's `got
subjects [...]` line, and offers it back as a copy-paste env-var value.

### `docs/SECURITY.md` — documentation

New subsection under "Verifying tarballs" documenting:
- `SIMPLE_CONTAINER_TRUST_PREVIEW_BRANCH` +
`SIMPLE_CONTAINER_TRUST_PREVIEW_SHA` usage
- Why "trust all preview" is intentionally not supported
- Manual `cosign verify-blob` equivalent for preview tarballs (branch +
SHA pin)

## Security analysis

Threats addressed:

| Threat | Fix |
|---|---|
| Any-branch trust | `_BRANCH` required; `=1` rejected; branch allowlist
+ regex escape |
| Workflow file content not pinned | Optional `_SHA` via
`--certificate-github-workflow-sha` (verifies OID 1.3.6.1.4.1.57264.1.3)
|
| Persistent env var = ambient policy | Loud stderr warning on every
preview install; `_BRANCH` value must change per preview, narrowing the
persistence window |
| CDN-account compromise serves different-branch tarball | Branch pin
neutralizes — different branch's tarball at the requested URL fails
verification |
| CDN serves old commit of the same branch | SHA pin (recommended)
neutralizes |
| Regex injection / identity shadowing | Branch allowlist + escape of
`.` metachar |

Threats explicitly out of scope (separate concerns, documented in commit
message):
- Same CDN namespace (publish side, not verify side)
- Installed `sc --version` vs requested interlock (defense-in-depth,
separate hardening pass)
- Path-traversal-safe extraction (pre-existing concern in extract logic)

## Testing

8 manual test cases against the live preview tarball
`v2026.5.26-pre.4cc1a03-preview.4cc1a03`:

| # | Setup | Expected | Result |
|---|---|---|---|
| A | no env vars | strict reject + auto-extracted hint | ✅ |
| B | `ALLOW_PREVIEW=1` | fail-fast before cosign | ✅ |
| C | `_BRANCH=evil.*` | validation reject | ✅ |
| D | `_BRANCH=feat/wrong` (valid format, doesn't match signer) | cosign
mismatch | ✅ |
| E | correct `_BRANCH` only | success + warning | ✅ |
| F | correct `_BRANCH` + correct `_SHA` | success + SHA-pinned warning
| ✅ |
| G | correct `_BRANCH` + wrong `_SHA` (40-hex zeros) | cosign rejects |
✅ |
| H | correct `_BRANCH` + invalid `_SHA` (`not-a-sha`) | validation
reject | ✅ |

Semgrep `shell-curl-pipe-to-shell` rule: 0 findings (curl|sh example was
removed from error text).

## Test plan

- [x] Manual: all 8 paths verified locally
- [x] Semgrep CI clean (was failing on the previous commit — fixed by
removing the curl|sh example string from error text)
- [ ] CI: branch-preview.yaml run on this fix branch confirms the
rebuilt sc.sh still works for production installs
- [ ] After merge + a push.yaml release: downstream consumers
(integrail/devops install-sc, PAY-SPACE wrappers) gain the opt-in path
automatically; PR #277's preview build becomes installable via `_BRANCH`
+ `_SHA` pin

## Refs

- PR #277 — the trigger for needing this; its preview build is the first
artifact unblocked by this change
- Round-2 review comment with full triage:
#278 (comment)

---------

Signed-off-by: Dmitrii Creed <creeed22@gmail.com>
Cre-eD added 2 commits May 20, 2026 12:59
Round-3 polish driven by 14-day production data from everworker (n=232 fires)
plus a 107-min post-apply observation window. Two changes — both validated
against codex + gemini security review.

## Change 1: split kmsKeyPolicyChanges → kmsKeyPolicy + kmsKeyGrants

Production observation: in 107 min post-apply, integrail-deployer-bot
(the Pulumi CI principal) generated 27 CreateGrant + 19 RetireGrant events
on KMS — natural deploy churn (every encrypted resource needs a grant).
PutKeyPolicy events: 0. The merged detector with default threshold 1 would
either flood the security-critical channel on every deploy OR mask a real
PutKeyPolicy event under a high threshold tuned for Pulumi noise.

Split into two detectors with separate signal models:

- kmsKeyPolicy (default ON, threshold 1)
  - Matches: PutKeyPolicy
  - "Who can use this key" is a structural security-boundary change. Rare
    even in active orgs; page on any occurrence.

- kmsKeyGrants (default OFF, threshold 10/300s)
  - Matches: CreateGrant / RetireGrant / RevokeGrant
  - High-volume in IaC-driven environments. Default off so most consumers
    won't see noise; opt-in for orgs with specific audit needs, with the
    expectation they'll add a principal exclusion override.

Backwards-compat: the kmsKeyPolicyChanges selector is GONE (not aliased).
Consumers using the old name will get a loud `validateOverrides` error at
deploy time pointing at the new keys. SC #277 is not yet released so the
BC break has zero downstream impact.

Codex + gemini both recommended split; aliasing was discussed and rejected
since there are no existing consumers to protect.

## Change 2: tighten unauthorizedApiCalls default

Two changes to the base filter + default threshold:

a) Default-exclude userIdentity.type = "AWSService"
   60% of historical AccessDenied events in our production sample were AWS
   service-linked roles (Macie, AWS Config, ResourceExplorer, Inspector)
   probing optional services for capability discovery. Zero security
   value: an attacker who compromises a service-linked role surfaces as
   AssumedRole in CloudTrail, not AWSService (service-linked role sessions
   never directly produce customer-side API calls in attacker scenarios).
   Excluding at base saves every downstream consumer from re-implementing
   the same override.

b) Default-exclude userIdentity.type = "AWSAccount"
   AWSAccount-type AccessDenied events are covered by the dedicated
   anonymousProbes detector (same threshold 10/300s). Making the two
   detectors disjoint avoids double-paging on the same event class in
   the high-volume case. Low-volume cross-account access (stale trust,
   third-party integration failure) was below threshold in either
   detector anyway, so no signal is lost.

c) Raise default threshold from 5 to 10
   CIS Benchmark "1 AccessDenied = alert" is too noisy for any active
   AWS account. 5 was an arbitrary middle-ground in the original PR;
   production data showed 10/300s absorbs natural permission-evaluation
   noise (eventual consistency, mistyped CLI commands, optional-service
   self-probes after the AWSService exclude) without missing real bursts.
   Aligned with anonymousProbes' threshold for symmetry.

NOT EXISTS guard preserved on both type clauses so events where
$.userIdentity.type is absent don't get silently dropped.

## Tests (28 total, was 23)

Five new tests lock in the new defaults:

- TestUnauthorizedApiCalls_DefaultThresholdIsTen
- TestUnauthorizedApiCalls_DefaultExcludesAWSService
- TestUnauthorizedApiCalls_DefaultExcludesAWSAccount
- TestKmsKeyPolicy_HighSignalDefault (PutKeyPolicy only, no grants)
- TestKmsKeyGrants_HighVolumeDefault (grants only, threshold 10, no PutKeyPolicy)
- TestKmsKeyPolicyChanges_OldNameRemoved (validateOverrides catches old YAML)

totalDetectors bumped 22 → 23 (kms split). Reflection test asserts every
bool selector wires through enabledAlerts.

## What consumers need to update

YAML (`.sc/stacks/<stack>/server.yaml`):

OLD:                                NEW:
  kmsKeyPolicyChanges: true           kmsKeyPolicy: true     # default-on, page on PutKeyPolicy
                                      # kmsKeyGrants: false  # opt-in for grant-lifecycle audit

  overrides:                          overrides:
    unauthorizedApiCalls:               unauthorizedApiCalls:
      threshold: 10                       # remove — plugin default is now 10
      excludeUserTypes: ["AWSService"]    # remove — plugin default now excludes AWSService + AWSAccount

Migration captured in the devops PR #428 update.

Signed-off-by: Dmitrii Creed <creeed22@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant